A High Performance FPGA-Based Accelerator for BLAS Library Implementation
نویسندگان
چکیده
This paper describes the implementation and the performance analysis of a hardware accelerator for the BLAS library matrix multiplication operation. This accelerator is based on a dual-FPGA board and on an implementation BLAS software library making use of the FPGA-based hardware. In order to evaluate the performance of such a system, we implemented the matrix multiplication operation (BLAS “dgemm” function) using an optimized matrix multiplication FPGA design and we implemented the software “dgemm()” function to make use of the FPGA-based board in a completely transparent way for the user. In contrast with others works [2,5,6,10], the measured performance is based on the global runtime of the FPGA-accelerated “dgemm” function at software level, taking into account the data transfers between the host computer and the FPGA board, and the software preand post-processing. We show that using the developed FPGA-based BLAS accelerator it is possible to achieve 60% higher performance than a fully software implementation running on a high-end computer. Through a detailed analysis, this paper also shows that the most limiting factors are data transfers between the host computer memory and the FPGA board memory, and the data transfers between this memory and the FPGA itself.
منابع مشابه
FPGA accelerator for floating-point matrix multiplication
This study treats architecture and implementation of a FPGA accelerator for double-precision floating-point matrix multiplication. The architecture is oriented towards minimising resource utilisation and maximising clock frequency. It employs the block matrix multiplication algorithm which returns the result blocks to the host processor as soon as they are computed. This avoids output buffering...
متن کاملField Programmable Gate Array Implementation of Active Control Laws for Multi-mode Vibration Damping
This paper investigate the possibility and effectiveness of multi-mode vibration control of a plate through real-time FPGA (Field Programmable Gate Array) implementation. This type of embedded system offers true parallel and high throughput computation abilities. The control object is an aluminum panel, clamped to a Perspex box’s upper side. Two types of control laws are studied. The first belo...
متن کاملThe Implementation of BLAS level 3 on the AP 1000 : Preliminary Report ∗
The Basic Linear Algebra Subprogram (BLAS) library is widely used in many supercomputing applications, and is used to implement more extensive linear algebra subroutine libraries, such as LINPACK and LAPACK. To take advantage of the high degree of parallelism of architectures such as the Fujitsu AP1000, BLAS level 3 routines (matrix-matrix operations) are proposed. This project is concerned wit...
متن کاملDesign and Implementation of Digital Demodulator for Frequency Modulated CW Radar (RESEARCH NOTE)
Radar Signal Processing has been an interesting area of research for realization of programmable digital signal processor using VLSI design techniques. Digital Signal Processing (DSP) algorithms have been an integral design methodology for implementation of high speed application specific real-time systems especially for high resolution radar. CORDIC algorithm, in recent times, is turned out to...
متن کاملBLASFEO: Basic linear algebra subroutines for embedded optimization
BLASFEO is a dense linear algebra library providing high-performance implementations of BLASand LAPACK-like routines for use in embedded optimization. A key difference with respect to existing high-performance implementations of BLAS is that the computational performance is optimized for small to medium scale matrices, i.e., for sizes up to a few hundred. BLASFEO comes with three different impl...
متن کامل